arXiv:1507.08379vl [cs.LG] 30Jul2015 


1 


VMF-SNE: Embedding for Spherical Data 

Mian Wang, Dong Wang Member, IEEE 


Abstract —T-SNE is a well-known approach to embedding high¬ 
dimensional data and has been widely used in data visualization. 
The basic assumption of t-SNE is that the data are non- 
constrained in the Euclidean space and the local proximity can 
be modelled by Gaussian distributions. This assumption does not 
hold for a wide range of data types in practical applications, for 
instance spherical data for which the local proximity is better 
modelled by the von Mises-Fisher (vMF) distribution instead 
of the Gaussian. This paper presents a vMF-SNE embedding 
algorithm to embed spherical data. An iterative process is 
derived to produce an efficient embedding. The results on a 
simulation data set demonstrated that vMF-SNE produces better 
embeddings than t-SNE for spherical data. 

Index Terms —data embedding, data visualization, t-SNE, Von 
Mises-Fisher distribution 


I. Introduction 

IGH-DIMENSIONAL data embedding is a challenging 
task in machine learning and is important for many 
applications, particularly data visualization. Principally, data 
embedding involves projecting high-dimensional data to a low¬ 
dimensional (often 2 or 3) space where the major structure 
(distribution) of the data in the original space is mostly pre¬ 
served. Therefore data embedding can be regarded as a special 
task of dimension reduction, with the objective function set to 
preserve the structure of the data. 

Various traditional dimension reduction approaches can be 
used to perform data embedding, e.g., the principal com¬ 
ponent analysis (PCA) [I) and the multi-dimensional scal¬ 
ing (MDS) El. PCA finds low-dimensional embeddings that 
preserve the data covariance as much as possible. Classical 
MDS finds embeddings that preserve inter-sample distances, 
which is equivalent to PCA if the distance is Euclidean. Both 
the PCA and MDS are simple to implement and efficient in 
computation, and are guaranteed to discover the true structure 
of data lying on or near a linear subspace. The shortage is that 
they are ineffective for data within non-linear manifolds. 

A multitude of non-linear embedding approaches have been 
proposed. The first approach is to derive the global non-linear 
structure from local proximity. Eor example, ISOMAP extends 
MDS by calculating similarities of distant pairs based on 
similarities of neighbouring pairs El, a. The self-organizing 
map (SOM) or Kohonen net extends PCA and derives the 
global non-linearity by simply ignoring distant pairs Q. 
The same idea triggers the generative topographic mapping 
(GTM) ||6|, where the embedding problem is cast to a Bayesian 
inference with an EM procedure. The local linear embedding 
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(LEE) follows the same idea but formulates the embedding 
as a local-structure learning based on linear prediction Q. 
Another approach to deriving the global non-linear structure 
involves various kernel learning methods, e.g., the semi- 
definite embedding based on kernel PCA El and the colored 
maximum variance unfolding (CMVU) ||9l- 

A major problem of the above non-linear embedding meth¬ 
ods is that most of them are not formulated in a probabilistic 
way, which leads to potential problems in generalizability. 
The stochastic neighbor embedding (SNE) m attempts to 
solve the problem. It models local proximity (neighbourhood) 
of data in both the original and embedding space by Gaus¬ 
sian distributions, and the embedding process minimizes the 
kullback-leibler (KL) divergence of the distributions in the 
original space and the embedding space. 

A potential drawback of SNE is the ‘crowding problem’, 
i.e., the data samples tend to be crowded together in the 
embedding space ca. A UNI-SNE approach was proposed 
to deal with the problem, which introduces a symmetric cost 
function and a smooth model when computing similarities 
between the images (embeddings) of data in the embedding 
space ifT^ . With the same problem in concern, ini proposed 
t-SNE, which also uses a symmetric cost function, but employs 
a Student t-distribution rather than a Gaussian distribution to 
model similarities between images. T-SNE has shown clear 
superiority over other embedding methods particularly for data 
that lie within several different but related low-dimensional 
manifolds. 

Although highly effective in general, t-SNE is weak in 
embedding data that are not Gaussian. Eor example, there are 
many applications where the data are distributed on a hyper¬ 
sphere, such as the topic vectors in document processing llT3ll 
and the normalized i-vectors in speaker recognition IITtII . These 
spherical data are naturally modelled by the von Mises Eisher 
(vME) distribution rather than the Gaussian ifTsl . lITbl . iflTl . 
and hence are unsuitable to be embedded by t-SNE. This 
paper presents a vME-SNE algorithm to embed spherical 
data. Specifically, the Gaussian distribution and the Student t- 
distribution used by t-SNE in the original and the embedding 
space respectively are all replaced by vME distributions, and 
an EM-based optimization process is derived to conduct the 
embedding. The experimental results on simulation data show 
that vME-SNE produces better embeddings for spherical data. 
The code is online availabl^H. 

The rest of the paper is organized as follows. Section HI] 
describes the related work, and Section |III| presents the vME- 
SNE algorithm. The experiment is presented in IIVI and the 
paper is concluded in Section IVl 

'http://cslt.riit.tsinghua.edu.cn/resources.php7Public%20tools 
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II. Related work 

This work belongs to the extensively studied area of di¬ 
mension reduction and data embedding. Most of the related 
work in this field has been mentioned in the last section. 
Particularly, our work is motivated by t-SNE im, and is 
designed specifically to embed spherical data which are not 
suitable to be processed by t-SNE. A more related work is 
the parametric embedding (PE) m, which embeds vectors 
of posterior probabilities, thus sharing a similar goal as our 
proposal: both attempt to embed data in a constrained space 
though the constrains are different {^-\ in PE and £-2 in vME- 
SNE). 

Probably the most relevant work is the spherical semantic 
embedding (SSE) lfT9ll . In the SSE approach, document vectors 
and topic vectors are constrained on a unit sphere and are 
assumed to follow the vME distribution. The topic model and 
the embedding model are then jointly optimized in a generative 
model framework by maximum likelihood. However, SSE in¬ 
fers local similarities between data samples (document vectors 
in 113) using a pre-defined latent structure (topic vectors), 
which is difficult to be generalized to other tasks as the latent 
structure in most scenarios is not available. Additionally, the 
cost function of SSE is the likelihood, while vME-SNE uses 
the symmetric KL divergence. 


data that are confined in a non-linear subspace, this assumption 
is potentially invalid and the t-SNE embedding is no longer 
optimal. This paper focuses on spherical data embedding, 
for which t-SNE tends to fail. This is because the Gaussian 
distribution assumed by t-SNE can hardly model spherical 
data, and the Euclidean distance associated with Gaussian 
distributions is not appropriate to measure similarities on a 
hyper-sphere. A new embedding algorithm is proposed, which 
shares the same embedding framework as t-SNE, but uses 
a more appropriate distribution form and a more suitable 
similarity measure to model spherical data. 

B. vMF-SNE 

It has been shown that the vME distribution is a better 
choice than the Gaussian in modelling spherical data, and the 
associated cosine distance is better than the Euclidean distance 
when measuring similarities in a hyper-spherical space, for 
instance, in tasks such as spherical data clustering EOl, EH. 
Therefore, we present an embedding method based on the as¬ 
sumption that the data in both the original and the embedding 
space follow vME distributions. This new method is thus called 
‘vME-SNE’. 

Mathematically, the probability density function of the vME 
distribution on the (d-l)-dimensional sphere in R'^ is given by: 


III. vME-distributed stochastic neighbouring 

EMBEDDING 

A. t-SNE and its limitation 

Let {xi} denote the data set in the high-dimensional space, 
and {ui} denote the corresponding embeddings, or images. 
The t-SNE algorithm measures the pairwise similarities in the 
high-dimension space as the joint distribution of Xi and xj 
which is assumed to be Gaussian, formulated by the following: 

^-Wxi-XjW^/2a'^ 

g-||a,„-s„||2/2cr2 ■ 

In the embedding space, the joint probability of yi and yj 
is modelled by a Student t-distribution with one degree of 
freedom, given by: 

{^ + U-y3\?)-^ ... 

The cost function of the embedding is the KL divergence 
between pij and qij, which is formulated by: 

KLiP\\Q) = Y,Y.P^jln^- 

Qij 

I J 

A gradient descendant approach has been devised to conduct 
the optimization, which is fairly efficient IfTTI . Additionally, 
the symmetric form of Eq. ([T]i and the long-tail property of the 
Student t-distribution alleviate the crowding problem suffering 
the original SNE and other embedding approaches. 

The assumption that t-SNE holds deserves highlight: the 
joint probabilities of the original data and the embeddings 
follow a Gaussian distribution and a Student t-distribution, re¬ 
spectively. This is generally fine in most scenarios, however for 


where IItII = \\y\\ = 1, K > 0 and fi are parameters of 
the distribution and Cdin) is a normalization constant. Note 
that the vME distribution implies the cosine distance. As in 
t-SNE, the symmetric distance is used in both the original and 
embedding space. In the original space, define the conditional 
probability of xj given Xi as: 


fd{Xj-,Xi,Ki) 

the joint distribution pij is defined as follows: 


( 3 ) 


Pi\j Pj\i 
Pij — o 


( 4 ) 


In the embedding space, a simpler form of joint distribution 
is chosen by setting the concentration parameter ki the same 
for all yi. This choice follows t-SNE, and the rationale is that 
the distribution in the original space needs to be adjusted 
according to the data scattering around Xi. However, doing so 
in the embedding space will cause unaffordable complexity in 
computation, as we will see shortly. The joint distribution qij 
with this simplification is given by: 


Qij — 


Vj 




giiVmVn ' 


( 5 ) 


As in t-SNE, the KL divergence between the two distribu¬ 
tions is used as the cost function: 


i j 

By gradient descendant, minimizing C with respect to {yi} 
leads to the optimal embedding. The gradients will be derived 
in the following section. 
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C. Gradient derivation 
First note that 

id i,j 

Since the first item on the right hand side of the equation 
is in dependent of the embedding, minimizing £ equals to 
maximizing the following cost function; 

£ = 

Define Z = we have; 

£ = n’^p.jyjyj - InZ, 
i,j 


where j Pij = 1 has been employed. The gradient of £ 
with respect to the embedding yk is then derived as; 


dC 

dyk 


2k 


Pikyi 


1 dlnZ 

(6) 

Z dyk 


(7) 

qik)yi 

(8) 


This is a rather simple form and the computation is efficient. 
Note that this simplicity is partly due to the identical k in 
the embedding space, otherwise the computation will be very 
demanding. 

Algorithm[T]illustrates the vMF-SNE process. Notice that in 
the original data space, Ki is required. Following ifTTI . Ki is set 
to a value that makes the perplexity Vi equal to a pre-defined 
value V, where Vi is formulated by; 


Vi = (9) 

and iT( ) is the information entropy defined by; 

H{P3\i) = - '^P3\ilog2{j>j\i) 

3 

where pjp has been defined in Eq. Q. As mentioned in IfTTI . 
making the perplexity associated to each data point the same 
value normalizes the data scattering and so benefits outliers 
and crowding areas. 


IV. Experiment 

To evaluate the proposed method, we employ vMF-SNE to 
visualize spherical data and compare it with the traditional 
t-SNE. Since visualization is not a quantitative evaluation, 
an entropy-based criterion is proposed to compare the two 
embedding approaches. 


Algorithm 1 vME-SNE 

Require: 

Input; 

{xi] ll^ill = 1, f = 1,..., N}: data to embed 

V: perplexity in the original space 

k: concentration parameter in the embedding space 

T; number of iterations 

p; learning rate 

Output; 

{yP, Il2/ill = 1,* = 1, data embeddings 

Procedure: 

1: compute {Ki} according to Eq. 

2: compute Pij according to Eq. (|4|i, and set pu = 0 
3: randomly initialize {yi} 

4: for t = 1 to T do 
5: compute Qij according to Eq. (|5]l 

6: for z = 1 to N do 

7: Si = ^ according to Eq. (O 

8: yz = yi + pSi 

9: end for 

10: end for 


A. Simulation data 

The experiments are based on simulation data. The ba¬ 
sic idea is to sample k clusters of data and examine if 
the cluster structure can be preserved after embedding. The 
sampling process starts from the centers of the k clusters, 
i.e., {/ri;||/ri|| = l,z = l,...,fc}. Although the sampling for 
different pi is essentially independent, we adopt a different 
approach: firstly sample the first center pi, and then derive 
other centers {pi} by randomly selecting a subset of the 
dimensions of pi and flipping the signs of the values on these 
dimensions. By this way, the centers {pi} are ensured to be 
separated on the hyper-sphere, which generates a clear cluster 
structure associated with the data. 

Once the cluster centers are generated, it is easy to sample 
the data points for each cluster following the vME distribution. 
A toolkit provided by Arindam Banerjee and Suvrit Sra was 
adopted to conduct the vMF sampling In this work, the 
dimension of the data is set to 50, and 800 data points are 
sampled in total. The concentration parameter k used in the 
sampling also varies, in order to investigate the performance of 
the embedding approaches in different overlapping conditions. 

B. Visualization test 

The first experiment visualizes the spherical data with vME- 
SNE. The perplexity V is set to 40, and the value of k in the 
embedding space is fixed to 2 (see Algorithm [U. The data are 
generated following vMF distributions by setting the scattering 
parameter k to different values. Eig.[T]presents the embedding 
results on 3-dimensional spheres with vME-SNE, where the 
two pictures show the results with k= 15 and k=4Q respectively. 
Note that the k here is used in data sampling, neither the k 
used to model the original data (which is computed from V 

^http://suvrit.de/work/soft/movmf 
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Fig. 1: The 3-dimensional embedding with vMF-SNE, with 
data generated following a vMF distribution by setting n = 
15 (left) and k = 40 (right). The original dimension is 50, 
and there are 4 clusters, each of which is represented by a 
particular color. 



Fig. 2: The 3-dimensional embedding with vMF-SNE (left) 
and 2-dimensional embedding with t-SNE (right). The data 
was generated following a vME distribution by setting k = 15. 


for each data point) nor the k used to model the embedding 
data (which has been fixed to 2). It can be seen that vME- 
SNE indeed preserves the cluster structure of the data in the 
embedding space, and not surprisingly, data generated with a 
larger k are more separated in the embedding space. 

Eor comparison, the same data are embedded with t-SNE 
in 2-dimensional space. The tool provided by Laurens van 
der Maaten is used to conduct the embddin^ where the 
perplexity is set to 40. The comparative results are shown 
in Eig. |2]and Eig. |3for data generated by setting k= 15 and 
K=10 respectively. It can be observed that when k is large 
(Eig. |2|l, both vME-SNE and t-SNE perform well and the 
cluster structure is clearly preserved. However when k is small 
(Fig. 0, vME-SNE shows clear superiority. This suggests 
that t-SNE is capable to model spherical data if the structure 
is clear, even if the underling distribution is non-Gaussian; 
however in the case where the structure is less discernable in 
the high-dimensional space, t-SNE tends to mess the boundary 
while vME-SNE still works well. 

C. Entropy and accuracy test 

Visualization test is not quantitative. Eor further investiga¬ 
tion, we propose to use the clustering accuracy and entropy as 
the criteria to measure the quality of the embedding. This is 
achieved by hrst hnding the images of the cluster centers, and 
then classifying the data according to their distances to the 
centers in the embedding space. The classihcation accuracy 
is computed as the proportion of the data that are correctly 
classihed. The entropy of the z-th cluster is computed as 


Eig. 3: The 3-dimensional embedding with vME-SNE (left) 
and 2-dimensional embedding with t-SNE (right). The data 
was generated following a vME distribution by setting k = 10. 


TABLE I; Results of Entropy and Accuracy 


4 Clusters 

Entropy 

Accuracy 

K. 

t-SNE 

vMF-SNE 

t-SNE 

vMF-SNE 

to 

0.6556 

0.5922 

42% 

64.13% 

20 

0.4725 

0.4187 

85.38% 

92.63% 

30 

0.3804 

0.3676 

97.38% 

98.5% 

40 

0.3485 

0.3466 

99.75% 

99.95% 

16 Clusters 

Entropy 

Accuracy 

10 

0.3152 

0.2975 

15.5% 

16.88% 

20 

0.2812 

0.2608 

38.25% 

40.75% 

30 

0.2312 

0.2383 

68.25% 

55.13% 

40 

0.1964 

0.2187 

91.25% 

60.63% 


c{i,j)ln{c{i,j)) where c{i,j) is the proportion 
of the data points generated from the j-th cluster but are 
classihed as the z-th cluster in the embedding space. The 
entropy of the entire data set is computed as the average 
of H{i) over all the clusters. Table U presents the results. 
It can be observed that in the case of 4 clusters, vME-SNE 
achieves lower entropy and better accuracy than t-SNE when 
K is small. If k is large, both the two methods can achieve 
good performance, for the reason that we have discussed. 

In the case of 16 clusters, it is observed that vME-SNE 
outperforms t-SNE with small k values (large overlaps). This 
seems an interesting property and demonstrates that using the 
matched distribution (vME) is helpful to improve embedding 
for overlapped data. However, with k increases, vME-SNE can 
not reach a performance as good as that obtained by t-SNE. 
A possible reason is that the large number of clusters leads to 
data crowding which can be better addressed with the long- 
tail Student t-distribution used by t-SNE. Nevertheless, this 
requires further investigation. 

V. Conclusions 

A vME-SNE algorithm has been proposed for embedding 
high-dimensional spherical data. Compared with the widely 
used t-SNE, vME-SNE assumes vME distributions and cosine 
similarities with the original data and the embeddings, hence 
suitable for spherical data embedding. The experiments on a 
simulation data set demonstrated that the proposed approach 
works fairly well. Euture work involves studying long-tail vME 
distributions to handle crowding data, as t-SNE does with the 
Student t-distribution. 


^http://lvdmaaten.github.io/tsne/ 
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